The Datset is taken from https://www.citibikenyc.com/system-data. It contains data about rides and have features - Trip Duration (seconds), Start Time and Date, Stop Time and Date, Start Station Name,End Station Name, Station ID, Station Lat/Long, Bike ID, User Type (Customer = 24-hour pass or 3-day pass user; Subscriber = Annual Member), Gender (Zero=unknown; 1=male; 2=female), Year of Birth.
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import datetime
%matplotlib inline
Load in your dataset and describe its properties through the questions below. Try and motivate your exploration goals through this section.
# One time step - Unzipping the downloaded files
# from zipfile import ZipFile
# # specifying the zip file name
# for year in np.arange(2014, 2017):
# for month in np.arange(1, 13):
# m_str = str(month)
# if len(m_str)<2:
# m_str = "0"+m_str
# file_name = str(year)+m_str+"-citibike-tripdata.zip"
# with ZipFile(file_name, 'r') as zip:
# zip.extractall()
# print('Done!')
# # trip = pd.read_csv("2017-fordgobike-tripdata.csv")
# One time step - Merging all dataframes into one, to get data for year 2014
# trip_all = pd.DataFrame()
# for year in np.arange(2014, 2015):
# for month in np.arange(1, 13):
# m_str = str(month)
# if len(m_str)<2:
# m_str = "0"+m_str
# file_name = str(year)+m_str+"-citibike-tripdata.csv"
# # print(file_name)
# if trip_all.empty:
# trip_all = pd.read_csv(file_name)
# else:
# new_df = pd.read_csv(file_name)
# # trip_all = trip_all.append(new_df, ignore_index=True)
# print("Before "+file_name+" trip_all.shape: ",trip_all.shape[0], "new_df.shape: ",new_df.shape[0])
# trip_all = pd.concat([trip_all, new_df], ignore_index=True)
# print("After "+file_name+" trip_all.shape: ",trip_all.shape[0])
# # df.head()
# print("Complete")
# Fetching a randam sample of 10000 records only, as the data for whole year is too big for processing and writing to trip_sample_10000.csv
# samples = np.random.choice(trip_all.shape[0], 10000, replace = False)
# trip_samp = trip_all.loc[samples,:]
# trip_samp.to_csv("trip_sample_10000.csv", index=False)
trip = pd.read_csv("trip_sample_10000.csv")
trip.info()
trip.dropna(subset=['start station id','end station id', 'bikeid','start station name', 'end station name', 'usertype'], axis=0, inplace=True)
# date type columns to correct format
for d_col in ['starttime', 'stoptime']:
trip[d_col] = pd.to_datetime(trip[d_col])
trip.sort_values(by='starttime', inplace=True)
# int colums to str type
for int_col in ['start station id', 'end station id', 'bikeid']:
trip[int_col] = pd.Categorical(trip[int_col], categories = trip[int_col].unique(), ordered=False)
trip.info()
trip.columns
trip = trip[['tripduration', 'starttime', 'stoptime', 'start station id'
,'end station id', 'bikeid', 'usertype','birth year', 'gender','start station name', 'end station name']]
def get_time_of_day(x):
return str(x.hour)
trip['starttime_hour'] = trip['starttime'].apply(get_time_of_day)
trip['starttime_day_name'] = trip['starttime'].dt.day_name()
trip['starttime_month_name'] = trip['starttime'].dt.month_name()
trip['starttime_date'] = trip['starttime'].dt.date
trip.shape[0]
dayname_order = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
trip['starttime_day_name'] = pd.Categorical(trip['starttime_day_name'], categories=dayname_order, ordered=True)
months_order = ['January','February','March','April','May','June', 'July','August','September','October', 'November','December']
trip['starttime_month_name'] = pd.Categorical(trip['starttime_month_name'], categories=months_order, ordered=True)
hours_order = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16",
"17", "18", "19", "20", "21", "22", "23"]
trip['starttime_hour'] = pd.Categorical(trip['starttime_hour'], categories=hours_order, ordered=True)
trip['tripduration_min'] = trip.tripduration/60
trip.info()
There are 10000, records in the dataset with 16 features. There are 5 categorical features, 2 datetime features, 5 string type features, 3 integer feature and 1 float feature.
I want to find out how number of bike rides are affected by usertype and time of ride.
I expect the time of rides(starttime_hour, starttime_day_name, starttime_month_name) to have a major impact on the number of rides. I expect Customer Usertype to ride less frequently than Subscriber Usertype.
In this section, investigate distributions of individual variables. If you see unusual points or outliers, take a deeper look to clean things up and prepare yourself to look at relationships between variables.
trip.columns
# Plotting distribution of tripduration_min
bins = np.arange(0, 1400, 5)
plt.hist(data=trip, x='tripduration_min', bins=bins);
Most of the data is set to the far left, suggesting some strong outliers on the right.Lets identify these outliers and see if they need to be filtered out of the data.
duration_df = trip.query('tripduration_min > 100')[['starttime', 'stoptime','tripduration_min','tripduration']]
duration_df['stoptime'].sub(duration_df['starttime'])/ np.timedelta64(1, 'm') - duration_df['tripduration_min']
duration_df['tripduration_hour'] = duration_df['tripduration_min']/60
duration_df
All the high outliers appear to be valid points.Some outliers have extremely high values of 5 to 23 hours, this can be due some system error , or some other issue. Removing these outlier for consistency.
# remove outlier values where tripduration_min >= 100 min
trip = trip.query('tripduration_min < 100')
trip.shape[0]
# Plotting distribution of tripduration_min
bin_size = 1
bins = np.arange(0, 100+bin_size, bin_size)
plt.hist(data=trip, x='tripduration_min', bins=bins);
# tripduration_min has a long tail on the right, plotting it on log scale
# Plotting distribution of tripduration_min on log scale
log_bin_size = 0.05
bins = 10**np.arange(0, np.log10(trip['tripduration_min'].max())+log_bin_size, log_bin_size)
plt.hist(data=trip, x='tripduration_min', bins=bins)
plt.xscale('log')
x_ticks = [1, 2, 5,10,20,50,100]
plt.xticks(x_ticks, x_ticks)
plt.xlabel('trip duration (min)')
tripduration_min has a long-tailed distribution, with 45 trips having trip duration more than 100 mins and rest of the 9955 trips having trip duration less than 100 mins. When plotted on a log-scale, the price distribution looks unimodal, with peak between 5 and 20.
# Plotting hourly, weekly, monthly trend of number of rides
fig, ax = plt.subplots(nrows=3, figsize = [7,12])
base_color = sb.color_palette()[0]
cat_var = ['starttime_hour','starttime_day_name', 'starttime_month_name']
for n in np.arange(0, len(cat_var)):
col = cat_var[n]
sb.countplot(data=trip, x=col, color=base_color, ax = ax[n]);
plt.xticks(rotation=45)
plt.xlabel(col)
plt.ylabel('counts')
# Plotting percentage of total number of rides for each usertype
plt.figure(figsize=[7,4])
user_df = round(trip['usertype'].value_counts()*100/trip.shape[0], 1).reset_index().rename(columns={'index':'usertype', 'usertype':'percentage_rides'})
plt.barh(user_df['usertype'], user_df['percentage_rides'], color=base_color);
plt.ylabel('User Type')
plt.xlabel('Percentage of Rides')
user_df.set_index('usertype')
Next, I will analyze Number of rides starting from each station and ending to each station
# Plotting total number of outgoing rides for each station in descending order
plt.figure(figsize=[25, 100])
plt.rcParams['xtick.top'] = plt.rcParams['xtick.labeltop'] = True
plt.subplot(1,2,1)
col = 'start station name'
start_station_order = trip['start station name'].value_counts().index
sb.countplot(data=trip, y=col, color=sb.color_palette()[0], order=start_station_order);
plt.ylabel(col)
plt.xlabel("count")
plt.axvline(90)
# Plotting total number of incoming rides for each station in descending order
plt.subplot(1,2,2)
col = 'end station name'
end_station_order = trip['end station name'].value_counts().index
sb.countplot(data=trip, y=col, color=sb.color_palette()[0], order=end_station_order);
plt.ylabel(col)
plt.xlabel("count")
plt.axvline(90)
plt.rcParams['xtick.top'] = plt.rcParams['xtick.labeltop'] = False
# preparing data for plotting total number of rides for each station in descending order
starting_rides = trip.groupby('start station name').agg({'bikeid':'count'}).reset_index().rename(columns={'bikeid': 'start station bike count'})
# preparing data for plotting total number of rides for each station in descending order
ending_rides = trip.groupby('end station name').agg({'bikeid':'count'}).reset_index().rename(columns={'bikeid': 'end station bike count'})
# preparing data for plotting total number of rides for each station in descending order
bikes_count_df = pd.merge(left=starting_rides, right=ending_rides, how='outer', left_on='start station name', right_on='end station name')
bikes_count_df.dropna(subset=['start station name'], inplace=True, axis=0)
bikes_count_df['end station bike count'].fillna(0, inplace=True)
bikes_count_df['total_bike_count'] = bikes_count_df['start station bike count'] + bikes_count_df['end station bike count']
bikes_count_df.sort_values('total_bike_count', ascending=True, inplace=True)
# Plotting total number of rides for each station in descending order
fig, axes = plt.subplots(ncols=2, sharey=True, figsize=[50, 150])
# plt.rcParams['xtick.top'] = plt.rcParams['xtick.labeltop'] = True
y = np.arange(bikes_count_df['start station name'].shape[0])
axes[0].barh(y, bikes_count_df['end station bike count'])
axes[1].barh(y, bikes_count_df['start station bike count'])
axes[0].invert_xaxis()
axes[0].set(yticks=y, yticklabels=bikes_count_df['start station name'])
axes[0].yaxis.tick_right()
for ax in axes:
ax.tick_params(labelsize=25)
fig.subplots_adjust(wspace=0.29)
for ax in axes.flat:
ax.margins(0)
ax.grid(True)
# plt.rcParams['xtick.top'] = plt.rcParams['xtick.labeltop'] = False
bikes_count_df.tail(10).sort_values('total_bike_count', ascending=False)[['start station name','start station bike count','end station bike count']].set_index('start station name')
bikes_count_df.head(10)[['start station name','start station bike count','end station bike count']].set_index('start station name')
tripduration_min has a long-tailed distribution. When plotted on a log-scale, the price distribution looks unimodal, with peak between 5 and 20. Few datapoint with unusually high trip duration were removed from dataset.
Most of the trips are made in weekdays(Monday-Friday), this can be because of these days are working days. However, there is not a very huge difference between weekdays and weekends, which needs further investigation
Plot for Time of Day Vs Count - is majorly bimodal, with peaks between 8 - 10 hours, 15 - 20 hours but has a small peak at 12 hours. People are riding while travelling to office and back to home but a small peak at 12 hours which needs further investigation.
User type - 90.3% of the rides are taken by Subscribers and 9.7% by Customers. There is a huge difference between the two.
Lets look at the correlation between categorical variables.
# Plotting hourly, weekly, monthly trend of number of rides for each user type
plt.figure(figsize = [8, 15])
plt.subplot(3, 1, 1)
sb.countplot(data = trip, x = 'starttime_hour', hue = 'usertype', palette = 'Blues')
ax = plt.subplot(3, 1, 2)
sb.countplot(data = trip, x = 'starttime_day_name', hue = 'usertype', palette = 'Blues')
plt.xticks(rotation=35)
ax = plt.subplot(3, 1, 3)
sb.countplot(data = trip, x = 'starttime_month_name', hue = 'usertype', palette = 'Blues')
plt.xticks(rotation=45)
plt.show()
Lets look at how trip_duration_min correlates with hour, day and month
# Plotting hourly trend for tripduration_min
fig, ax = plt.subplots(ncols = 1, nrows = 3 , figsize = [10,12])
categoric_vars = ['starttime_hour','starttime_day_name','starttime_month_name']
for i in range(len(categoric_vars)):
var = categoric_vars[i]
sb.boxplot(data=trip, x=var, y='tripduration_min', color=base_color, ax = ax[i])
plt.xticks(rotation=45)
Plotting 'starttime_hour','starttime_day_name','starttime_month_name' Vs log_tripduration_min , to analyze it better
def log_trans(x, inverse = False):
""" quick function for computing log and power operations """
if not inverse:
return np.log10(x)
else:
return np.power(10, x)
# creating log_tripduration_min column for ease of analysis
trip['log_tripduration_min'] = trip['tripduration_min'].apply(log_trans)
# Plotting hourly trend for tripduration_min on log scale
fig, ax = plt.subplots(ncols = 1, nrows = 3 , figsize = [10,12])
categoric_vars = ['starttime_hour','starttime_day_name','starttime_month_name']
for i in range(len(categoric_vars)):
var = categoric_vars[i]
sb.boxplot(data=trip, x=var, y='log_tripduration_min', color=base_color, ax = ax[i])
y_ticks = [1,2,5,10,20,50,100]
ax[i].set_yticks(log_trans(np.array(y_ticks)))
ax[i].set_yticklabels(y_ticks)
ax[i].set_ylabel("Trip duration (min)")
plt.xticks(rotation=45)
# Plotting tripduration_min Vs usertype
var = 'usertype'
sb.boxplot(data=trip, y=var, x='log_tripduration_min', color=base_color)
x_ticks = [1,2,5,10,20,50,100]
plt.xticks(log_trans(np.array(x_ticks)),x_ticks);
plt.xlabel('Trip duration (min)');
# Plotting total number of rides for each station in descending order for Subscriber and Customer
plt.figure(figsize=[25, 100])
plt.rcParams['xtick.top'] = plt.rcParams['xtick.labeltop'] = True
plt.subplot(1,2,1)
col = 'start station id'
start_station_order = trip[trip['usertype']=='Subscriber']['start station id'].value_counts().index
sb.countplot(data=trip, y=col, hue = 'usertype', order=start_station_order, palette = 'Blues');
plt.ylabel(col)
plt.xlabel("count")
plt.axvline(90)
plt.subplot(1,2,2)
col = 'end station id'
end_station_order = trip[trip['usertype']=='Subscriber']['end station id'].value_counts().index
sb.countplot(data=trip, y=col, hue = 'usertype', order=end_station_order, palette = 'Blues');
plt.ylabel(col)
plt.xlabel("count")
plt.axvline(90)
plt.rcParams['xtick.top'] = plt.rcParams['xtick.labeltop'] = False
Trip duration Vs Month - Trip duration is longer for months from April to October.
Trip duration Vs Day of Week - Trip duration is longer on Saturday and Sundays. univariate analysis, less number of trips were taken on Saturday Sunday. Combining the two analysis, more number of short trips are taken on weekdays, and more number of long trips are taken on Weekends
Trip duration Vs Time of Day - Surprisingly Trip duration is maximum at 3 hours and plunges to minimum at 4 hours. This is an area to be investigated.
starttime_hour vs count plot is unimodal for Subscribers is bimodal, however it is unimodal for Customers
count of rides by Subscriber decreases on weekends however it increases for Customers
month wise trend for count of rides follow similar trend for both Subscriber and Customer
Trip duration is longer on Saturday and Sundays. univariate analysis, less number of trips were taken on Saturday Sunday. Combining the two analysis, more number of short trips are taken on weekdays, and more number of long trips are taken on Weekends
Surprisingly Trip duration is maximum at 3 hours and plunges to minimum at 4 hours.
Create plots of three or more variables to investigate your data even further. Make sure that your investigations are justified, and follow from your work in the previous sections.
trip.columns
# Plotting Hourly trend for each day of week and usertype
g=sb.FacetGrid(data=trip, row="starttime_day_name", col="usertype", height=3, aspect=1.6, margin_titles=True);
g.map(sb.countplot, 'starttime_hour', order=hours_order);
Subscriber - On Weekdays, plot has an addition peak at 12 hours, which means people are probably riding for lunch, but for Friday the peak is at 13 hours along with reduction in frequency at 17 hours. Giving an insight that on Friday evenings's less rides are taken as compared to other weekday's. People are probably travelling to places other than their regular routine, using other modes of transport. On Saturday Subscribers have almost a uniform distribution between 8 hours to 21 hours, but on Saturday there are some spikes at 10, 12, 15, 17 hours, giving us an intution that people may be travelling for some extracurricular classes/activities.
Customer - For Customer type of users, rides during week days are very less, and looks like they ride on need basis. On weekends the number of rides is comparatively greater with peak at 14 hours.
# Plotting Hourly trend for each month and usertype
g=sb.FacetGrid(data=trip, row="starttime_month_name", col='usertype', height=3, aspect=1.6, margin_titles=True)
g.map(sb.countplot, 'starttime_hour',order=hours_order)
For months, the trend for rides is following weather cycle, with minimum rides in February. For Customers, mostly there is a spike at 15 hours but in month of July the spike has shifted to 19 hours, this may be attributed to high temperature in July, so people like to ride in the evening.
short_month_list = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
short_dayname_order = ['Mon', 'Tue', 'Wed', 'Thur', 'Fri', 'Sat', 'Sun']
# Plotting Weekly trend for each month and usertype
g=sb.FacetGrid(data=trip, row="starttime_month_name",col='usertype', height=3, aspect=1.6, margin_titles=True)
g.map(sb.countplot, "starttime_day_name", order=dayname_order)
plt.xticks(np.arange(0, 7), short_dayname_order);
plt.tight_layout()
def maxofdf(df1,df2):
return max(df1.max().max(),df2.max().max())
def minofdf(df1,df2):
return min(df1.min().min(),df2.min().min())
# Preparing data for heatmap for number of rides for Subscriber
result1 = trip[trip["usertype"]=="Subscriber"].groupby(["starttime_month_name","starttime_day_name"]).agg({"bikeid":'count'}).reset_index().rename(columns={"starttime_month_name":"Ride Start Month", "starttime_day_name":"Ride Start Day", "bikeid":"Count"})
result1 = result1.pivot(index="Ride Start Month", columns="Ride Start Day", values="Count")
# Preparing data for heatmap for number of rides for Customer
result2 = trip[trip["usertype"]=="Customer"].groupby(["starttime_month_name","starttime_day_name"]).agg({"bikeid":'count'}).reset_index().rename(columns={"starttime_month_name":"Ride Start Month", "starttime_day_name":"Ride Start Day", "bikeid":"Count"})
result2 = result2.pivot(index="Ride Start Month", columns="Ride Start Day", values="Count")
# Preparing data for heatmap for tripduration for Subscriber
trip_dur_year_s = trip[trip["usertype"]=="Subscriber"].groupby(["starttime_month_name","starttime_day_name"]).agg({"tripduration_min":'sum'}).reset_index().rename(columns={"starttime_month_name":"Ride Start Month", "starttime_day_name":"Ride Start Day", "tripduration_min":"Ride Duration(min)"})
trip_dur_year_s = trip_dur_year_s.pivot(index="Ride Start Month", columns="Ride Start Day", values="Ride Duration(min)")
# Preparing data for heatmap for tripduration for Customer
trip_dur_year_c = trip[trip["usertype"]=="Customer"].groupby(["starttime_month_name","starttime_day_name"]).agg({"tripduration_min":'sum'}).reset_index().rename(columns={"starttime_month_name":"Ride Start Month", "starttime_day_name":"Ride Start Day", "tripduration_min":"Ride Duration(min)"})
trip_dur_year_c = trip_dur_year_c.pivot(index="Ride Start Month", columns="Ride Start Day", values="Ride Duration(min)")
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(12,10))
# Plotting heatmap for number of rides for Subscriber
sb.heatmap(result1, cmap="viridis_r", ax=ax[0,0], fmt="",vmin=minofdf(result1, result2),vmax=maxofdf(result1, result2), cbar=False,xticklabels=False)
ax[0,0].set_title("Subscriber Count of Rides")
# Plotting heatmap for number of rides for Customer
sb.heatmap(result2, cmap="viridis_r", ax=ax[0,1], fmt="",vmin=minofdf(result1, result2),vmax=maxofdf(result1, result2), yticklabels=False,xticklabels=False)
ax[0,1].set_title("Customer Count of Rides")
# Plotting heatmap for tripduration for Subscriber
sb.heatmap(trip_dur_year_s, cmap="viridis_r", ax=ax[1,0], fmt="",vmin=minofdf(trip_dur_year_s, trip_dur_year_c),vmax=maxofdf(trip_dur_year_s, trip_dur_year_c),cbar=False)
ax[1,0].set_title("Subscriber Trip duration(min)")
# Plotting heatmap for tripduration for Customer
sb.heatmap(trip_dur_year_c, cmap="viridis_r", ax=ax[1,1], fmt="",vmin=minofdf(trip_dur_year_s, trip_dur_year_c),vmax=maxofdf(trip_dur_year_s, trip_dur_year_c),yticklabels=False)
ax[1,1].set_title("Customer Trip duration(min)")
plt.tight_layout()
For Subscriber, Total Trip duration in weekdays is mostly greater than that in weekends.While for Customer, Total Trip duration in weekdays is mostly less than that in weekends
Subscriber - Trip duration and count of rides are greater for months April to October and Monday to Friday. In July, Trip duration and count of trips is maximum on Tuesday, Wednesday and Thursday, giving a intuition that these days might be holidays. Similar but less prominent trend is observed for January-Monday, Thursday, May-Wednesday, Thursday and Friday, August-Thursday, Friday, Saturday, September-Monday Tuesday, Wednesday, October-Friday, April-Thursday
Customer - Interestingly for June and August - Saturday, Sunday, high trip duration and trip counts observed. Similar but less prominent trend is observed for September-Monday, May-Monday.
# Plotting Date Vs total rides count
plt.figure(figsize=[20,10])
date_usertype_df= trip.groupby(['starttime_date','usertype']).agg({'bikeid':'count'}).reset_index().rename(columns={'bikeid':'count'})
ax = sb.pointplot(x='starttime_date', y='count', hue='usertype', palette='viridis_r', scale=.5, data=date_usertype_df)
Lets plot the above plot with Monthly total rides on y axis.
# Overall Trend : Month Vs total rides count
plt.figure(figsize=[15,7])
month_usertype_df= trip.groupby(['starttime_month_name','usertype']).agg({'bikeid':'count'}).reset_index().rename(columns={'bikeid':'count'})
ax = sb.pointplot(x='starttime_month_name', y='count', hue='usertype', palette='viridis_r', scale=.7, data=month_usertype_df)
For Subscribers, Total Number of rides decrease in June, then increase again in July.For Customers, Total Number of rides decrease in July, then increase again in August.
For Subscribers, on Weekdays, plot had an addition peak at 12 hours, which is probably beacuse people are riding for lunch, but for Friday the peak is at 13 hours along with reduction in frequency at 17 hours. Giving an insight that on Friday evenings's less rides are taken as compared to other weekday's. People are probably travelling to places other than their regular routine, using other modes of transport. On Saturday Subscribers have almost a uniform distribution between 8 hours to 21 hours, but on Saturday there are some spikes at 10, 12, 15, 17 hours, giving us an intution that people may be travelling for some extracurricular classes/activities.
For Customer type of users, rides during week days are very less, and looks like they ride on need basis. On weekends the number of rides is comparatively greater with peak at 14 hours.
For Subscribers, Total Number of rides decrease in June, then increase again in July.For Customers, Total Number of rides decrease in July, then increase again in August.
tripduration strenghtened the count of rides feature, giving us an overall view of the trend.
Interaction between tripduration and count of trips gives us an interesting insight about some special days of the year when both the tripduration and count of trips were at high end. Such days can be some holidays or some bike riding event.